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ABSTRACT 



This paper presents a statistically sound method for measur- 
ing the accuracy with which a probabilistic model reflects 
the growth of a network, and a method for optimising pa- 
rameters in such a model. The technique is data-driven, and 
can be used for the modeling and simulation of any kind of 
evolving network. 

The overall framework, a Framework for Evolving Topol- 
ogy Analysis (FETA), is tested on data sets collected from 
the Internet AS-level topology, social networking websites 
and a co-authorship network. Statistical models of the growth 
of these networks are produced and tested using a likelihood- 
based method. The models are then used to generate arti- 
ficial topologies with the same statistical properties as the 
originals. This work can be used to predict future growth 
patterns for a known network, or to generate artificial mod- 
els of graph topology evolution for simulation purposes. Par- 
ticular application examples include strategic network plan- 
ning, user profiling in social networks or infrastructure de- 
ployment in managed overlay-based services. 

Categories and Subject Descriptors 

C.2.1 [Network Architecture and Design]: Net- 
work Topology; G.2.2 [Graph Theory] : Network Prob- 
lems 

General Terms 

Measurement, Design 

Keywords 

Network evolution, Likelihood-based models 

1. INTRODUCTION 



In recent years there has been much interest in cre- 
ating simple probabilistic models which can be used 
to produce topologies which replicate certain statisti- 
cal properties of a given target network. Many of these 
models depend on a procedure by which a network is 
progressively "grown" from a small "seed" (with a hand- 
ful of links) into an artificial topology which is as large 
as required. If the model is successful, the artificial net- 
work will have similar properties to the original. Thus, 
these models rely on finding a network evolution model 
that produces networks which are structurally similar 
to the target network. 

In much of the previous research in this field the 
usual way of achieving this is to hypothesise an evo- 
lution model for the target network, grow an artificial 
network of at least the same size using that model, and 
compare several key graph theoretical statistics with the 
respective ones from the target network. This is usu- 
ally done multiple times, so that the expected values 
of the statistics can be obtained. If after this process 
the model is found to be unsatisfactory, it is updated 
accordingly and the whole process repeated. 

Thus, the development of a topology evolution model 
following the methodology detailed above will require 
the construction of large numbers of "test" topologies 
that use tentative evolution models. Since the construc- 
tion of these topologies can be computationally cumber- 
some if the networks in question are large, the analy- 
sis of network evolution models has not been widely 
adopted in practice. 

The main contribution of this work is two-fold. Firstly 
(and most importantly), we present a set of statistics 
that directly measure the likelihood of a given prob- 
abilistic evolution model giving rise to a given target 
network - no "test" topologies need to be constructed. 
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Secondly, we present a framework for exploratory test- 
ing and optimisation of certain (quite general) classes 
of network growth models. 

The statistics that our technique produces are an un- 
ambiguous and statistically rigorous measure of the like- 
lihood of the evolution of the target network arising 
from any particular hypothesised probabilistic model. 
These statistics are quick to produce (much more so 
than growing a test network of the same size) , and could 
be used as a fitness function for state-space searches or 
genetic algorithms to automatically optimise parametrised 
classes of models. 

We will refer to this statistical framework as FETA 
(Framework for Evolving Topology Analysis). 

The structure of this paper is as follows. Section [2] 
describes the FETA framework in detail. Section 12.11 
shows how the model likelihood is derived, and section 
12. 21 describes the fitting procedure for optimising model 
parameters. Section [3] describes the model fitting for 
the five network examples investigated in this paper. 
Section U shows how the fitted models can be used to 
generate artificial topologies that replicate specific sta- 
tistical measures of the corresponding real networks. 

1.1 Motivation 

The problem of creating artificial topologies with the 
same growth dynamics as a target network is an impor- 
tant one. As networks grow, their statistical measures 
change and undesirable emergent properties may occur. 
A good statistical model of how a given target network 
grows is an important goal which has applications in 
many fields, but especially in the design and optimi- 
sation of distributed computation and communication 
systems. A tier one network provider may wish to be 
able to model the future growth of the AS network to 
predict and potentially avoid undesirable network prop- 
erties, or to strategically choose its peering agreements. 
The owner of an online social network may wish to 
be able to predict, from their position in the network, 
consumption patterns or demographic characteristics, 
which users are more likely to accrue "friends" and 
hence influence others. This information can be used 
for targeted advertising, marketing or capacity planning 
purposes. Finally, a provider of overlay-based services 
(such as Skype or Akamai) may need to plan based 
upon the future evolution of their overlay network. As 
key network statistics change, they may wish to adapt 
their protocols, or to modify infrastructure deployment 
strategies accordingly. 

1.2 Background 

The field of generating graphs (or networks or topolo- 
gies, the words seem to be used almost interchange- 
ably in the literature) using random processes is usu- 
ally considered to begin with Erdos and Renyi [H]. An 



early study by Price [7] found that the degree distribu- 
tion of co-authorship network of scientific papers obeyed 
a power law. Much later it was discovered that the 
Internet Autonomous System (AS) graph also follows 
a power law [9] and this finding was also shown to 
apply to a large number of other networks, including 
social networks, hyperlinked document networks and 
networks derived from biological systems. The well- 
known Barabasi-Albert (BA) model [5] provided a sem- 
inal explanation of scaling network topologies in terms 
of a "preferential attachment" model where "rich get 
richer" : the probability of connecting to a given node 
is exactly proportional to its degree. This led to several 
papers which attempt to explain network evolution in 
terms of node degree and related properties such as the 
BA [5;, ASIM [TT] and AB 2 models. 

Bu and Towsley [5] introduced the Generalised Linear 
Preference (GLP) model, which modifies the preferen- 
tial attachment model by raising the degree of the node 
to a small power. Zhou and Mondragon [19j presented 
the Positive-Feedback Preference (PFP) model, which 
also modifies preferential attachment by raising node 
degree to a small power (but this power also depends 
on the node degree). 

It has been shown that a model which faithfully re- 
produces the node degree distribution may not capture 
all the important properties of a graph [TB] . To account 
for this, the ORBIS model T2^ reproduces the statistics 
of subgraphs of small orders to take account of degree- 
degree and higher order characteristics. The ORBIS 
model is slightly different to the growth models which 
the FETA approach uses, as the model uses rewiring 
and rescaling rather than a hypothesised growth model. 

The typical assessment of topology generation mod- 
els has focused on measuring a number of statistics on 
the real network data. Such measures have included the 
number of nodes and links, average and maximum node 
degree, best- fit power-law exponent, rich-club connec- 
tivity, probability of nodes with low degrees (1, 2 and 
3), characteristic path length, average and maximum 
triangle coefficient, average and maximum quadrangle 
coefficient, average fc„„ and average and maximum be- 
tweenness (see [10] for definitions of these properties 
and a review of topology generation from an Internet 
perspective) . A candidate artificial model is then tested 
by creating an artificial topology using the model and 
seeing how well the topology reproduces several statis- 
tics measured on the real data set. Occasionally, a new 
network statistic may be added which existing models 
do not reproduce and this can be used to justify a differ- 
ent, improved model. This approach to model testing 
and refining based on the generation of test topologies 
and the comparison of a set of statistical measures be- 
tween the test topologies and the target network is char- 
acterised here as the "basket of statistics" approach. 
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Figure 1: The FETA approach compared with the "basket of statistics" approach. 



Willinger et al [TH] called for a "closing of the loop" 
between the discovery of "emergent phenomena" and 
the models which reproduce them. They emphasise 
the importance of a "validation" step to ensure that 
a particular proposed model is consistent with the real 
data. The evaluation framework given in section 12.11 
provides this validation, albeit at the expense of requir- 
ing data about how the network evolves rather than a 
static snapshot. In so doing, however, the generation of 
test topologies is avoided and the processing of bigger 
graphs or more complicated models becomes possible. 

Figure [1] contrasts the approach used by FETA with 
the "basket of statistics" approach which has been pre- 
viously used. By directly assessing model likelihood, 
our approach short circuits the cycle of generating and 
measuring test networks to optimise a test model. 

2. FRAMEWORK FOR EVOLVING TOPOL- 
OGY ANALYSIS 

The Framework for Evolving Topology Analysis (FETA) 
allows the investigation of growth models for real net- 
works where information is known about the order in 
which links were added to the network. The aim of 
FETA is to produce probabilistic models which fit the 
observed evolution of these networks. The class of mod- 
els which FETA can work with includes BA _3], AB [2], 
GLP [5] andPFP [IS]. 

The probabilistic models used by FETA are described 
in terms of two components referred to as an inner 
model and an outer model. It is the inner model which 
FETA evaluates and fits, which means that only models 
based on probabilistically growing networks are com- 
patible with FETA. 

Definition 1. The outer model chooses an opera- 
tion which will make a change to the existing network. 
This could be add and connect a node, add a link between 
nodes, delete a node or delete a link between nodes. The 
inner model defines the probabilities for selecting the 
node or nodes involved in the operation. 

Definition 2. The inner model defines the proba- 
bilities for selecting the node or nodes involved in the 



operation defined by the outer model. The inner model 
used by a given evolution model can vary, depending on 
whether the outer model operation is a connection to a 
new node or a connection between existing inner nodes. 

For example, the AB model would correspond to an 
outer model which adds a new node and then chooses 
exactly three inner nodes to connect it to, along with 
an inner model that chooses nodes with probabilities 
proportional to their node degree. In PFP and GLP, 
the outer model is the same as with AB, but the inner 
model is different: the probability is proportional to the 
degree raised to a power. 

Remark 1. As is common in the literature, the main 
focus of this paper is on the inner model. The outer 
models, where needed in this paper, are assumed to be 
of the following simple form: a new node is joined to 
N existing nodes; following this M ( which can be zero ) 
inner edges are added. The values of N and M are ran- 
domly selected from probability distributions empirically 
derived from the target network. 

The framework is flexible enough to allow or disallow 
the possibility of multiple edges between the same node 
pair, and to allow or disallow nodes with connections to 
themselves. In this paper only undirected, connected, 
simple (no repeated edges and no edges from a node 
to itself) graphs are considered. Removal of nodes or 
edges is not considered in this paper. 

Section l^TTl describes the evaluation procedure of FETA, 
that works with any inner model which can assign prob- 
abilities to nodes or edges in a graph. It will produce 
statistically reliable measures of how well the model fits 
the observed data. Section [2?2l describes a fitting proce- 
dure which works with a subset of inner models which 
combine sub-models and fit them using General Lin- 
ear Models. Section describes how FETA is used in 
practice and gives information about scalability. 

2.1 Model evaluation using FETA 

Consider data from an empirical network which shows 
how the network grows in time by the addition of nodes 
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and edges. This growth data can be decomposed into 
decisions from the outer model (whether a link is be- 
tween existing inner nodes or to a new node) and the 
choices of node (which would be controlled by the inner 
model). 

Let Go be the known state of the graph at a certain 
time. Assume that the graph is extended by adding 
edges (sometimes between existing nodes and sometimes 
in addition to a new node) one at a time. Assume, fur- 
ther, that the state of the graph is known for each one of 
these edge additions up to some step t (Go, Gi, . . . ,Gt 
is known). Let Oj be the outer model operation (con- 
nect edge to new node or connect edge between existing 
nodes) for the ith edge addition since Gq. Let li be the 
node or nodes selected by the inner model for the outer 
model operation Oi . Together Oi and li define the tran- 
sition between Gi-i and Gi. Conversely, if Gi_i and Gi 
are known then Oi and li are also known for 1 < i < t. 
The best outer and inner models are those which best 
explain Oi and /i, respectively, for the observed peri- 
ods. This paper focuses on the selection of the inner 
model. 

Let G stand for all of the observed inner model choices 
/i , /2 , . . . , /( , and let 9 be some inner model which at- 
tempts to explain the observed inner model choices G 
in terms of some statistical properties of the graph. At 
each step i, 9 maps graph properties (and, perhaps, 
other properties, such as whether a new node or an in- 
ner edge is being connected, or properties associated 
with the node but exogenous to the network topology) 
to probabilities. 

In order to simplify the explanation, assume for re- 
mainder of this section that the outer model always in- 
volves the choice of a single existing node to connect to 
a new node. In this case G is simply an ordered list 
of the nodes chosen at each observed step and li is the 
node connected at step i. Evaluation of the model 9 is 
now a matter of calculating the likelihood of G given 
the model 9. The larger this likelihood, the better the 
model fits the observed data. 

Let Pi{j\9) be the probability that inner model 9 as- 
signs to node j at step i. To be a valid model 9 should 
ensure that Pi{j\9) = 1 where the sum is over nodes. 
It follows that pi{Ii\9) is the likelihood of the choice li 
at step i given model 9. The likelihood of all the obser- 
vations G given model 9 is given by the product 

t 

L{C\9) = \lp,{m. 

i=l 

It is also useful to define the log likelihood 1{G\9) = 
\og{L{C\9)). The larger the likelihood (or log likeli- 
hood) the better the model explains G. 

Definition 3. The nuU model 9q is defined as the 
model which gives every node in the choice set equal 



probability (this can also be thought of as the random 
modelj. The saturated model 9s is a model with as 
many parameters as data points. In this case, 9s en- 
sures that Pi{li\9s) = 1 for all i G {l,...,t}. Hence 
L{C\9s) = 1 and l{G\9s) = 0. 

Now it is useful to define some measures of the good- 
ness of the model using the statistic known as deviance. 

Definition 4. The deviance of model 9 is minus two 
times the log-likelihood ratio between the model 9 and the 
saturated model 9s , 

D = ~2{1{G\9) - l{C\9s)) = -21{C\9). 

Evidently, the deviance will always be positive (or 
zero for the saturated model), and the smaller it is, the 
better the model 9 explains the data. 

Definition 5. The null deviance Dq of a candidate 
model 9 is given by 

Do = -2{1{C\9) - l{G\9o)). 

Thus, Dq will always be negative if the model 9 ex- 
plains G better than the null (random) model 9o. The 
smaller Dq, the better 9 explains the choice set G. 

Because of the size of the data sets used in this work 
(|G| ^ 100,000) then the magnitude of D can be quite 
large and depends critically on the size of G. It is useful 
to have a statistic which defined on a more comprehen- 
sible scale, and invariant to the size of G. We present 
such new statistic, the per choice likelihood ratio: 

Definition 6. The per choice likelihood ratio cq is 
the likelihood ratio between 9 and the null model 9q nor- 
malised by the number of choices. 





■ L{C\9) ' 


i/t 


Cq = 


[hC\9o)\ 


~ exp 



The quantity cq is one if 9 is exactly as good as 9q, 
greater than one if it is better and less than one if it 
is worse. Note that D, Dq and cq are simply different 
ways of looking at the model likelihood. 

It should be noted though, that while a lower de- 
viance or a higher per choice likelihood ratio always 
indicate a better fit, this alone does not mean a model 
should be preferred. The saturated model 9s gives a 
perfect fit to data, but it is a useless model for prac- 
tical purposes since it can only reproduce the data it 
has been given. What is needed is a trade off between 
fit to data and a parsimonious model. Adding new pa- 
rameters to a model is only good if the improvement to 
the fit (reduction in D, increase in cq) justifies the extra 
parameter. One criteria would be Akaike's An Informa- 
tion Criterion (AIC) pj which is given by A = D -\- 2k 
where k is the number of free parameters in the model. 
However, given the size that D typically attains in this 
modelling, this seems unlikely to prove a useful distinc- 
tion. 
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Example 1. An example will help comprehension. 
Consider an initial graph which is the two link network 
consisting of nodes {1,2,3} and edges {(1, 2), (2, 3)}. 
The network grows by adding node 4 and link (2, 4) 
and then node 5 and link (2,5). We assume the simple 
outer model add one node and connect it to one exist- 
ing node at every stage. The inner model must explain 
C — (2,2), /i = 2 and I2 — 2 given Gq and Gi. The 
null model 9q predicts equal probabilities (^1/3 each) for 
node 4 to connect to nodes 1, 2 or 3 and equal proba- 
bilities of 1 /4 each, for node 5 to connect to nodes 1 to 
4. Therefore, for this C and the null model the likeli- 
hood is pi{Ii\9o) = 1/3 and p2{l2\9o) = 1/4. The null 
likelihood L{C\9o) = 1/12. //, on the other hand, we 
consider as the preferential attachment model (prob- 
ability of attachment proportional to node degree ) then, 
given Gq the node probabilities are (1/4,1/2,1/4) and 
given Gi they are (1/6,1/2,1/6,1/6). The likelihoods 
are pi{Ii\9) — 1/2 and P2{l2\9) = 1/2 giving a final 
likelihood L{C\9) = 1/4. From this, deviance, null de- 
viance and per choice likelihood ratio can be calculated. 
Naturally, real data sets will have many more choices 
and many more nodes. 

Remark 2. The selection of edges from a set of all 
possible edges would present a difficult computational 
problem, as the set of all possible edges increases ap- 
proximately as the square of the number of nodes. This 
can be avoided by considering the probability of choosing 
edge (ni,n2) as the probability of choosing ni followed 
by n2, plus the probability of choosing n2 followed by ni 
(assuming ni ^ n2). The second choice set can be nar- 
rowed to avoid self loops and nodes already connected to 
the first node if a simple graph is desired. 

Remark 3. Separate inner models can be fitted to 
each type of operation for the outer model. Therefore, 
for example, the hypothesis that new nodes connect us- 
ing preferential attachment and inner edges connect us- 
ing PFP can be explored. The data set G can be .split 
into two parts, those choices associated with connecting 
to new nodes and those choices associated with adding 
edges between existing nodes. In this case, the deviance 
of the full inner model is the sum of the deviance of the 
model components, and the per choice likelihood ratio cq 
can be calculated accordingly. 

2.2 Model fitting using FETA 

The deviance and per choice hkehhood ratio can de- 
termine which inner model is a better fit for a given data 
set. However, for parametrised models, they do not al- 
low the automatic tuning of parameters. In this sec- 
tion a method is introduced based upon the statistical 
technique of Generalised Linear Models (GLM) which 
allows certain (linear) parameters to be automatically 
tuned for an inner model. Again, for simplicity of dis- 



cussion, this section considers only inner models which 
connect nodes to a new node. 

Consider an inner node model 9. It may be that the 
ideal model is not pure preferential attachment or PFP, 
but some mixture of these models. Further, it follows 
that probabilities may be affected other factors (both 
inherent in the graph topology and exogenous to the 
topology but available as a data input) . 

Let dj{i) be the degree of node i in graph Gj_i (the 
graph used to make choice j). Let pj{i\9) be the prob- 
ability that model 9 assigns to node i for choice j. For 
the null model then we have that 

p,m) = G^, 

where C" is a normalising constant for a given choice 
(that is, it is constant for a given j) so that the probabil- 
ities sum to one over all i. Similarly, for the preferential 
attachment model, referred to for now as 9d, then we 
have that 

p,{i\9,) = Gfd,{i), 

where, again, Cj is a normalising constant for a given j. 
Let Tj{i) be the number of triangles (3-cycles) node i 
is part of in graph Gj-i. Now we can consider some 
hypothetical model 9t where connection probabilities 
depend upon the triangles, 

p,{i\9t)^G'^t,{t), 

where, again, Cj is a constant for fixed j. A model can 
be considered which is a linear combination of 0p 
and 9t. Call this hypothesised model 9. For this model 
we would have that 

Pjii\9) = poPjit\9o) + PdPjii\9d) + PtPjm), (1) 

where /3o, /3d and pt are all in the range [0, 1] and sum 
to one. These constants are the proportion of each of 
the models which contribute to the final model. We use 
GLM to find the optimal combination of /3 parameters 
for a given data set. A brief summary of GLM follows. 

Let y = {j/i, ?/2, ■ • ■ , 2/Af} be some set of observed data 
we desire to model. Let = {x\,xl, . . . ,x\f} and 
x"^,!^,. . . (similarly defined) be sets of observed data 
that is to be used to explain y. A relationship is hy- 
pothesised which allows y to be estimated in terms of 
x^, x"^ and so on. A model is to be fitted of the form 

y = Pa+ Pix^ + P2x'^ + (i^x^ + e, (2) 

where the j3i are parameters (not constrained to a range 
this time) which give the contribution of the various 
components to the variable y and /3o which is an inter- 
cept parameter and e is an error component. Fitting 
GLM can be done automatically using a statistical lan- 
guage such as Given observed data, this can be read 
into R and a GLM fitting procedure can be used to find 

^http: //www.r-project . org| 
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those (3 values which maximise the model likelihood. 
In addition, the fitting procedure produces the model 
deviance and estimates of the errors and statistical sig- 
nificance for each of the model parameters. If a param- 
eter is not statistically significant it should usually be 
removed from the model. 

Let Pj{i) be an indicator variable which is 1 if and 
only if node i was actually the node picked for choice j , 
and otherwise. The problem of finding the best model 
in ll]) becomes the problem of fitting the GLM, 

Pj{i) = Pop-j{i\eo) + PdPj{i\Od) + PtPj{i\Ot) + e, 

to find the combined model 9 that best predicts the 
Pj{i)- Thus, GLM fitting can be used to find the choice 
of /3i which maximises the likelihood of this model. This 
is equivalent to finding the (3i which gives the maxi- 
mum likelihood for 9 since for model 9, the expectation 

This will give the choice of Pi which best combine the 
model components into the unified model 9. If the f3i 
are in the range [0, 1] and sum to one, it can be trivially 
shown that is a valid probability model as long as 
Pj{i\9i), pj{i\92), ■ . . are. 

So, for the period between Go and Gt, for each node, a 
data point is generated with the parameters of the graph 
relevant to the models, and with a 1 or a depending on 
whether that node was the node actually selected as an 
outcome of that choice. The procedure has been tested 
on data from artificially generated networks and it has 
been found to be able to successfully recover their f3i 
parameters in a wide variety of circumstances. Certain 
model component combinations might be problematic 
to fit, however. An example of this would be a model 
constructed from a PFP and a preferential attachment 
component: since these explanatory variables are very 
similar for most nodes, finding a satisfactory mix using 
GLM is usually extremely hard. 

Note that only the f3i parameters can be automat- 
ically optimised by the GLM fitting procedure. Any 
other parameters such as the S in the PFP model must 
be fitted by other means, such as trying a number of 
parameter choices and comparing the deviance or per 
choice likelihood ratio. 

Remark 4. As pointed out in remark\^ separate mod- 
els can be fitted to nodes connecting to new nodes and 
connections to inner edges. The items of data are sepa- 
rated by an analysis tool and they are fitted in different 
GLM models. As in remark\^ fitting inner edges causes 
issues for the framework. The choice of inner edges is 
broken down into the choice of two nodes. The choice 
set for the second node is constrained by removing from 
the choice set those nodes which already have a link from 
the first node. 

2.3 FETA in practice 



The FETA evaluation process therefore consists of 
hypothesising inner models which might fit the evolu- 
tion of a target network and calculating their likelihood 
statistics as shown in section [2T] The fitting procedure 
in section l^T^ is used as an exploratory tool both to tune 
linear combinations of model components and also to 
provide hints as to which other components might be 
introduced (for example, a negative j3 parameter will 
rarely produce a usable model but, for example, if the 
9t component produced a negative (it this suggests that 
the choice mechanism is avoiding nodes with a high tri- 
angle count). 

The graph in figure [2] shows the run time for measur- 
ing model likelihood (as described in section 12. ip and 
for creating a network file with a given number of links 
(using a test model which is part PFP and part ran- 
dom). The tests were run on a 2.66GHz quad core 
Xeon CPU. The plot is a log-scale showing how run 
time varies with network size. For 100,000 links the net- 
work creation process takes 2,631 seconds and for the 
likelihood estimation process takes 53 seconds. Both 
processes appear to scale approximately as the square 
of the number of links (for times under 1 second the tim- 
ing information is not accurate). Neither process takes 
a significant amount of memory. The relative speed of 
the evaluation of likelihood statistics is another benefit 
of the FETA approach. To tune the parameters of a 
hypothetical parametrised model using the "basket of 
statistics" approach, a new network would have to be 
grown for every test model. This is much more com- 
putationally intensive than the calculation of likelihood 
statistics required by FETA. 
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10000 
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Figure 2: Run time for network creation and 
analysis processes in FETA. 



3. FITTING MODELS TO NETWORK DATA 

The FETA procedure is used to create inner models 
for several networks of interest. Section [01 fits mod- 
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els to a co-authorship network inferred from the arXiv 
database. Section 13.21 fits models to a view of the AS 
network topology referred to here as the UCLA AS net- 
work and section [3731 fits models to a second view of the 
AS topology, which we refer to here as the RouteViews 
AS network. 

Finally, section 13.41 fits network evolution models to 
a network derived from user browsing behaviour, and 
section 13.51 fits models to a social network derived from 
the popular photo sharing site Flickr. 

Table [1] summarises the networks considered in terms 
of total edges, total nodes and the edge/node ratio. 



Network 


edges nodes edge/node 


arXiv 
UCLA AS 
RouteViews AS 
gallery 
Flickr 


15,788 9,121 1.73 
93,957 29,032 3.24 
94,993 33, 804 2.81 
50,472 26,958 1.87 
98,931 46,557 2.13 



Table 1: Sizes of the netvirorks analysed 



Several model components were considered in a linear 
combination as described in section [2?2l Those compo- 
nents are listed below, where pi is the probability of 
choosing node i and ka is a normalising constant such 
that Pi = \ when summed over the choice set. Further- 
more, di is the degree of node i and ti is the triangle 
count of node i. 

• 9q - the null model assumes all nodes have equal 
probability pi — kn- 

• 9d- the degree based preferential attachment model 
assumes node probability pi — kddi . 

• 9t - the triangle count model assumes node prob- 
ability Pi = ktU. 

• 01 - the singleton model assumes node probability 
Pi = fci if = 1 and = otherwise. 

• 02 - the doubleton model assumes node probability 
Pi = k2 if di — 2 and pi — otherwise. 

• 0p^'^ - the PFP model assumes node probability 

Note that the PFP model is the only one to require a 
parameter. 

This notation allows the concise description of a lin- 
ear additive model in terms of its components. For ex- 
ample = 0.101 + 0.94"°'*^ is a model which is has 
a component from the singleton model (contributing 
0.1 of the probability) and a component from the PFP 
model with parameter S = 0.04 (contributing 0.9 of the 
probability) . In the models 0i , 02 and 0t there is a pos- 
sibility of all nodes being assigned zero probability (if 



there are no singletons, doubletons or triangles respec- 
tively). In this case, 6*0 is substituted for that model 
component. This happens on extremely few occasions 
and always very early in network construction. Obvi- 
ously a large collection of model components could be 
tried but a conscious decision was taken to limit the 
number of possibilities for this experimentation. 
For each data set, three inner models are tried: 

1. a pure preferential attachment model, 

2. a pure PFP model (with an optimally tuned 6 for 
connections to new nodes, and another one for in- 
ternal edges), 

3. the best model found using the techniques from 
section 12.21 

Model one was picked because the preferential attach- 
ment is a reasonable baseline for improvement. Model 
2 was picked because investigation showed that for al- 
most every network the PFP model had low deviance. 
Model 3 was picked to show the improvement (if any) 
possible by using linear combinations of models. 

The outer model was derived simply by calculating 
empirically from the network data two distributions. 

1. the number of inner nodes each new node connects 
to on arrival, 

2. the number of inner edges connected between each 
new node arrival. 

These distributions are then used to create the outer 
model. This is simplistic and obviously further research 
is required to improve the techniques to generate this 
outer model. 

Results are presented using the metrics from section 
12.11 D is the deviance, Dq is the null deviance and cq is 
the per choice likelihood ratio. A better model is indi- 
cated by lower D and Dq, and by a high c. The results 
are broken down into the contribution from the inner 
model to connect to new nodes and the inner model for 
connecting internal edges. 

3.1 Fitting the arXiv data set 

A publication co-authorship network was obtained 
from the online academic publication network arXiv. 
The first paper was added in April 1989 and papers are 
still being added to this day. To keep the size manage- 
able, the network was produced just from the papers 
categorised as math. The network is a co-authorship 
network: an edge is added when two authors first write 
a paper together. In this case, because it is required 
that the network remains connected, edges which are 
not connected to the largest connected component are 
ignored. Multiple edges between two authors are not 
added. The processing of this network is far from per- 
fect, only author names (rather than unique IDs) are 
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matched. Inconsistent naming conventions mean some 
authors are recorded by first name and surname, and 
some by initial and surname. To avoid problems match- 
ing John Smith, J. Smith and John W. Smith, the 
match is on first initial and surname, though it is clear 
this will allow some collisions. One papeiH was removed 
from analysis. The paper has sixty authors, far more 
than the paper with the next largest number of authors. 
Since each author on a paper forms a graph clique with 
all the other coauthors in that same paper, this paper 
added 1,732 links for which no arrival order significant 
to the evolution of the network could be found. As a 
size 60 clique would distort most network statistics, it 
was rejected as an outlier. 

As described in the previous section, three models 
were tried, a preferential attachment model, a pure PFP 
model and the best model found using the fitting proce- 
dure. Model 2, the best PFP model was, for connections 
to new nodes, 9p ^'^^'^ and for connections between in- 

, ^(-0.02) 

ner edges Op 



for new node connections, 0.8816*^ 



Model 3, the best model found, was, 
0.1196'i and 

for internal node connections the pure PFP model as in 
model 2 (no better model could be found). 

As can be seen, the best model by all measures is 
model 3. It is worth noticing that the inner edge model 
does not perform significantly differently between the 
three (in any case this is the same model for 2 and 3). 
With such a small 6 parameter the model is almost the 
same as preferential attachment (model 1). It should 
also be noticed that the deviance itself is hard to com- 
pare simply because it is such a large number that the 
relative differences seem small. For the preferential at- 
tachment model, (model 1) the new node model was 
actually worse than the null model and this can be seen 
by the fact that Dq is positive and cq is less than one. 
Overall, the new node model appears to have made few 
gains relative to the new node model in all cases (cg 
is larger for the inner edge models than the new node 
models) despite the simplicity of the inner edge model. 
This suggests that improving the new node model is the 
best focus for model improvements in general. The im- 
provement in new node model cq from 1.06 in model 2 to 
1.09 in model 3 seems significant and indicates that the 
addition of a model reflecting singletons is useful. Re- 
membering that singletons in this case are authors with 
only a single other co-author, perhaps this is explained 
by a desire for those authors with a single co-author to 
collaborate with other new authors. 

3.2 Fitting the UCLA AS data set 

The data set we refer to here as the UCLA AS data 
set is a view of the Internet AS topology seen between 
January 2004 and August 2008. It comes from the In- 
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Model 2 


TnTiPr enffp 


1 1 8 000 


-4,170 


1.32 


Model 2 


Overall 


311,000 


6,610 


1.18 


Model 3 


New node 


193,000 


-3,090 


1.13 


Model 3 


Inner edge 


118,000 


-4,240 


1.32 


Model 3 


Overall 


311,000 


-7,340 


1.21 



Table 2: 
work. 



Three models tested on the arXiv net- 



ternet topology collectior[f| maintained by Oliviera et. 
al. [TS]. These topologies are updated daily using 
data sources such as BGP routingtables and updates 
from RouteViews, RIPE|f| AbilencO and LookingGlass 
servers. Each node and link is annotated with the times 
it was first and last observed during the measurement 
period. 

As previously stated, our network growth model does 
not include a removal process. On the other hand, var- 
ious links and nodes disappear from the UCLA data 
set during the time interval under analysis. To incor- 
porate this into our modelling framework, the data is 
preprocessed by removing all edges and nodes which 
are not seen in the final sixty days of the data, so that 
the final state of the evolution of the network is the AS 
network as it is in August 2008. Edges are introduced 
into the network in the order of their first sighting. If 
this would cause the network to become disconnected, 
their introduction is delayed until the arrival of other 
links and nodes allows them to join while maintaining 
a connected network at all times. 

The arrival order of edges is only known after timed 
link arrival data is available in January 2004. Further- 
more, there is a period of fast discovery of nodes and 
edges immediately after this time where the order of 
edge arrival is considered to be very uncertain (since 
snapshots are only daily and not every link will be dis- 
covered on the first day it exists). Thus, the first days 
of data are considered a "warm up" period and removed 
from the analysis. Gq is taken to be after this warm up 
period expires. The growth of the network is shown in 
figure [3l 

Exploratory model fitting on the UCLA AS network 
showed that a PFP model was again favoured. Again 
the inner edge model seemed to have a smaller b than 
the new node model. No model was found to be a great 
improvement over PFP but there was some evidence 
that including a small singleton component would make 



^http : //arxiv . org/abs/math/0406190] 



"http: //irl . cs .ucla.edu/topology/ 
http: //www. ripe .net/db/ irr .html/ 
^iittp : //abilene . internet2 . edu/] 
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Figure 3: Network growth for UCLA AS net- 
work. 

slight improvements. 

Three models are tested on the UCLA AS network. 
Model 1 is pure preferential attachment. Model 2 is 
pure PFP with different delta parameters - ^p^'^^*' for 
the new node model and O^p'^^"^^ for the inner edge 
model. The best model found was only slightly bet- 
ter than this, and it and combined PFP with a tiny 
amount of the singleton model The new node model 
was 0.9746'^°-°^^^ + 0.0266'i. The inner edge model was 
0.9606'^°°"' + 0.0406ii. 

The results of this fitting exercise are shown on table 
[31 In this case, the improvement against preferential 
attachment was extremely marginal. It was only model 
3 that showed an improvement, and this improvement 
was mostly in the inner edge model (indeed its new node 
model was worse than that of model 2). Overall, the cq 
values were relatively high indicating a good fit to the 
data compared with the random model. 



Model 


component 


D 


Do 


Co 


Model 1 


New node 


320,000 


-102,000 


10.6 


Model 1 


Inner edge 


1,790,000 


-402,000 


5.74 


Model 1 


Overall 


2,110,000 


-504,000 


6.33 


Model 2 


New node 


319,000 


-102,000 


10.8 


Model 2 


Inner edge 


1,790,000 


-402,000 


5.73 


Model 2 


Overall 


2,110,000 


-504,000 


6.33 


Model 3 


New node 


320,000 


-102,000 


10.7 


Model 3 


Inner edge 


1,780,000 


-405,000 


5.82 


Model 3 


Overall 


2,100,000 


-507,000 


6.41 



Table 3: Three models tested on the UCLA AS 
network. 



3.3 Fitting the Route Views AS data set 

For the present paper we define the RouteViews AS 
data set as the view of the Internet AS topology from 
the point of view of a single RouteViews data collec- 



tor. The raw data used to construct it comes from the 
University of Oregon Route Views Project [TJ, and it 
was recovered from the parsing of the routing tables ob- 
tained by running 'show ip bgp' on the command line 
of route-views3.routeviews.org and capturing the out- 
put. To construct the node and link arrival process to 
which we fit our evolution models we process one such 
table dump per day over the time interval between April 
11th, 2007 and January 16th, 2009. 

It is well known that an AS map obtained in such 
a way will not be representative of the true AS Inter- 
net topology (see [H [TH [SI [13] )■ However, a valida- 
tion framework like FETA should be able to discover 
this difference by fitting different growth models to the 
RouteViews AS data set and the more complete UCLA 
data set. 

Since the basic outer models that we set out to eval- 
uate do not have a node removal process, we consider 
only the addition of AS numbers and peerings to the 
AS map, as it is viewed from the perspective of route- 
views3.routeviews.org. Thus, we seek to model the cu- 
mulative AS growth process as viewed from a single 
BGP peer. 

As with the UCLA data set, we ignore the very first 
tables processed, as their dynamics are not representa- 
tive of the system equilibrium growth rate, and their 
timing information is unavailable. The growth of the 
network is shown in figure [4] 

100000 I , ^ , ^ , ^ 1 




100 200 300 400 500 600 700 
Days since start of data collection 

Figure 4: Network growth for RouteViews AS 
network. 

As before, model fitting on the RouteViews AS data 
set showed that a PFP model was favoured. As in the 
previous cases, the inner edge model seemed to have a 
significantly smaller S than the new node model. As 
with the UCLA small singleton component in 

the inner edge model yields slight improvements. 

Three models are tested on the RouteViews AS data 
set. Model 1 is pure preferential attachment. Model 2 
is pure PFP with different delta parameters {6 — 0.034 
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for the new node model and S = 0.003 for the inner 
edge model). Model 3 took the same new node model 
as model 2, but for the internal edge model it combined 
a pure PFP new node model with the singleton model 
according to 0.876l},°"^ + 0.136'i. 

The results are shown in table [H As before, the 
improvement against preferential attachment was ex- 
tremely marginal. It was only model 3 that showed 
improvement, as a consequence of a slightly better in- 
ner edge model. Again, cq values were relatively high 
indicating a good fit to the data compared with the 
random model. 



Model 


component 


D 


Do 


Co 


Model 1 


New node 


138,000 


-45,400 


12.7 


Model 1 


Inner edge 


1,478,000 


-257,000 


4.36 


Model 1 


Overall 


1,620,000 


-302,400 


4.81 


Model 2 


New node 


138,000 


-46,100 


13.21 


Model 2 


Inner edge 


1,480,000 


-257,000 


4.36 


Model 2 


Overall 


1,620,000 


-303,400 


4.83 


Model 3 


New node 


138,000 


-46,100 


13.21 


Model 3 


Inner edge 


1,470,000 


-264,000 


4.53 


Model 3 


Overall 


1,610,000 


-310,100 


5.00 



Table 4: Three models tested on the Route- 
Views AS network. 



As expected, the model found for the RouteViews AS 
data set is quite close to that one found from the UCLA 
data set, while still being different enough to accommo- 
date their differing topological characteristics. Over- 
all, the difference in the PFP/singleton mix between 
the best models fitted for the UCLA AS and Route- 
Views data sets suggests that, from the point of view of 
route-views3.routeviews.org, ASs with a single point of 
attachment to the Internet are more prone to becom- 
ing multihomed than they are from a more complete 
perspective of the AS topology. 

3.4 Fitting the gallery data set 

The website known simply as "gallery"@ is a photo 
sharing website. To be able to upload pictures and have 
some control over the display of pictures, users have to 
create an account and login. From webserver logs, the 
path logged in users browse as they move across the 
network can be followed. Thus, images become nodes 
in the networks, and a user browsing between two pho- 
tos creates a link between the two nodes that represent 
them. These links are overlaid for all users in order to 
form our network. 

Model 1 is, as usual, a pure preferential attachment 
model Od- The fitting of model 2 was problematic for 
this network, minimising deviance for the new nodes 
model with the unusually low delta value of (5 = —1.8. 

^http : // gallery . future- i . com/] 



For inner edges, the PFP model 6p ' had lowest 
deviance. Finally, Model 3 has the new node model 
0.5l69d + 0.484^^l, and the same inner edge model as 
model 2 - that is, 0^p^'^'^^\ 

Table [5] shows the model likelihood statistics, where 
the inadequacy of the proposed models to the network 
growth dynamics is apparent. In particular, the model 
to connect to new nodes was, for model 1 and model 2, 
worse than the null model Oq (which connects to nodes 
at random). Thus, FETA allows us to discover in a 
straightforward way that new node connections in this 
network do not have a preferential attachment struc- 
ture at all. We hypothesise that the peculiar new node 
arrival process arises from the fact that the browsing 
network is, uniquely amongst the networks examined 
here, a transient one in the sense that a link between 
two nodes is made by a user moving from one picture 
to the next - however, no permanent record of this is 
reflected to the user, and thus, user behaviour is not 
influenced by it. 



Model 


component 


D 


Do 


Co 


Model 1 


New node 


675,000 


44,300 


0.523 


Model 1 


Inner edge 


586,000 


-17,000 


1.23 


Model 1 


Overall 


1,260,000 


27,000 


0.815 


Model 2 


New node 


645,000 


14,000 


0.810 


Model 2 


Inner edge 


586,000 


-17,200 


1.30 


Model 2 


Overall 


1,230,000 


-2,750 


1.02 


Model 3 


New node 


529,000 


-102,000 


4.43 


Model 3 


Inner edge 


586,000 


-17,000 


1.30 


Model 3 


Overall 


1,110,000 


-119,000 


2.43 



Table 5: Three models tested on the gallery user 
network. 



3.5 Fitting the Flickr data set 

The FlickiQ website allows users to associate them- 
selves with other users by naming them as Contacts. 
In [14] the authors describe how they collected data for 
the graph made by users as they connect to other users. 
The first 100,000 links of this network is analysed here. 
The graph is generated by a web-crawling spider so the 
order of arrival of edges is the order in which the spi- 
der moves between the users rather than the order in 
which the users made the connections. Thus, the evo- 
lution dynamics of this network will be determined by 
the spidering code. 

The analysis compares only two different network mod- 
els, first a pure preferential attachment model 9d', sec- 
ond a PFP model with ojp''^^^ for the new nodes connec- 
tions and ol for the inner edge connections. No 
combined model was found which improved over the 
PFP model. 

^ |http : //flickr .comT] 
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Model 


component 


D Do Co 


Model 1 
Model 1 

Model 1 


New node 

Tnn PT f^r\o'(^ 

Overall 


379,000 -529,000 294 
1 fiOO 000 -479 000 9 83 
1,970,000 -1,010,000 27.9 


Model 2 
Model 2 
Model 2 


New node 
Inner edge 
Overall 


352,000 -555,000 389 
1,590,000 -481,000 9.93 
1,945,733 -1,040,000 30.7 



Table 6: Two models tested on the Flickr net- 
work. 

3.6 Discussion of model fitting 

Several conclusions can be drawn from the model fit- 
ting process. For most models considered in this sec- 
tion, providing different model components for the in- 
ner new node model and the inner edge model yields 
improved models. Thus, the separation of the inner 
model into a sub-model for connections to new nodes 
and a sub-model for new internal edges between exist- 
ing nodes is usually productive. 

Moreover, in all but one case (the Flickr data set) it 
was found that inner models with a higher likelihood 
could be obtained from a linear combination of model 
components. Thus, the ability to produce optimised 
models through the linear combination of sub-models is 
of use in finding improved network evolution models. 

Models based upon PFP generally had high likeli- 
hoods (but this was the only parametrised model com- 
ponent tried, so this might be simply an issue of in- 
creased flexibility in the fitting procedure). The model 
parameters selected for the two different AS networks 
were encouragingly similar, pointing to common net- 
work evolution dynamics, but had significant differences 
consistent with the way their measurement characteris- 
tics. 

4. ARTIFICIAL TOPOLOGY GENERATION 

The models explored in the the previous section have 
been generated purely by fitting proposed models so 
that their parameters best predicted the actual network 
growth process observed. Thus, the models were cre- 
ated without growing test networks, measuring statis- 
tics on them and further refining them - indeed, the 
models were identified without measuring any statistics 
about the real network. 

However, it is natural to expect that if the null de- 
viance Do and per choice likelihood ratio co predict that 
model 9a is "better" than model 9b, this will be re- 
flected in model 9a growing artificial networks with a 
better match to the statistics of the real network than 
model 9b- Here, therefore, artificial networks are gen- 
erated from the seed Go (In the the case of the AS 
networks is the state of the network shortly after mea- 
surements started, while in the three remaining cases 



this is simply a single edge) . Each of the models from 
the previous section and the random model are used to 
grow a network of the same size as the full real network, 
and summary statistics are compared. 

The results in this section need careful interpreta- 
tion. In particular, it should be remembered that the 
claim is not that these models are the best possible fit 
to the real network - in some cases, the claim is that 
the models tried are actually worse than simply select- 
ing nodes at random. The fitting procedure in section 
12.21 optimises the mixture of model components (the /3 
parameters), while the evaluation procedure can opti- 
mise other model parameters (such as the PFP S) using 
any state space search technique. However, the models 
themselves need to be provided as an input, and it may 
be the case that no perfect model is to be found from 
the model components chosen. 

However, independently of the precise mathematical 
description of the network growth models under test, 
a model with cq > 1 should be better than a random 
model, and the model with the highest cq should be 
the best. This is difhcult to achieve using statistical 
network measures: saying that one model reproduces 
real network statistics "better" that other model is, in 
itself problematic. If a model scores well on three highly 
related statistics but extremely badly on two others, 
is it a good model? The "basket of statistics" does 
not always give an unambiguous answer to as to which 
model is "best". 

For this section, four statistics related to the degree 
distribution are used: dmax, the maximum node degree 
in the network, di , the proportion of nodes with degree 
one, c?2, the proportion of nodes of degree 2 and (P, the 
mean square of the node degrees (c? is a property of the 
outer model and automatically equal to that of the real 
network in all models here). 

In addition, two further statistics are used captur- 
ing the interaction between pairs and triples of nodes. 
The clustering coefficient of a node is the number of 
3-cycles that the node belongs to, divided by the poten- 
tial number of 3-cycles between its neighbouring nodes 
(Obviously, nodes of degree one do not have any poten- 
tial triangles and the clustering coefficient is not defined 
for them). In the tables in this section 7 is the mean 
clustering coefficient for the graph. The assortativity 
coefficient r is positive when nodes attach to nodes of 
like degree (high degree nodes attach to each other) and 
negative when high degree nodes tend to attach to low 
degree nodes. For full definitions of all these quantities 
see Ho]. 

4.1 Topology generation using FETA models 

4.1.1 Statistics on the arXiv evolution model 

The summary statistics for the arXiv publication net- 
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Model 


d=l 


d=2 


d? 




r 


7 


Real 


0.314 


0.237 


31.3 


127 


0.00557 


0.145 


Rand 


0.233 


0.223 


23.9 


24 


0.245 


0.00285 


1 


0.483 


0.215 


123.5 


446 


-0.060 


0.0154 


2 


0.431 


0.204 


39.1 


97 


0.152 


0.00901 


3 


0.348 


0.258 


33.6 


75 


0.179 


0.00748 



Table 7: Summary statistics arXiv co- 
authorship network. 



work are in table [71 The previous modelling in table [D 
rated model 1 with cq = 1.15, model 2 with cq = 1.18 
and model 3 with cq = 1.21. If these figures are re- 
liable, model 3 should be expected to be a better fit 
than model 2 which is in turn a better fit than model 1 . 
This is certainly bourne out for the degree distribution 
statistics, with model 3 closest for d = 1, d = 2 and d? 
and marginally worse than model 2 for dmax- AH three 
models generate networks which replicate 7 badly, with 
model 1 being the closest. With respect to assortativ- 
ity, model 1 is closest in absolute terms but it predicts 
a disassortative network when the actual network is as- 
sortative. 

While these results are not straight-forward to inter- 
pret, the overall picture seems to confirm that model 
3 reproduces the statistics of the network better than 
model 2, which in turn beats model 1. The relatively 
low Co value means that the models should not be a 
dramatic improvement upon the random baseline, and 
this is bourne out by the statistics (indeed, for model 
1 it is arguable whether the model is even better than 
random). 

4.1.2 Statistics on the UCLA AS evolution model 



Model 


d=l 


d=2 


dp 




r 


7 


Real 


0.122 


0.245 


6,620 


3,150 


-0.197 


0.0584 


Rand 


0.129 


0.118 


210 


1,199 


-0.00962 


0.0173 


1 


0.447 


0.163 


2,230 


4,177 


-0.144 


0.0190 


2 


0.451 


0.167 


3,230 


5,305 


-0.168 


0.0148 


3 


0.363 


0.215 


3,820 


6,109 


-0.172 


0.0121 



Table 8: Summary statistics, UCLA AS net- 
work. 



As detailed in section 13.21 evolution information was 
not known for the early part of the the UCLA AS net- 
work growth. Therefore, the first 42,000 edges were 
taken from the original network, and its evolution fol- 
lowed from this. The statistics from table [3] gave cq — 
6.33 for model 1 and model 2, but cq — 6.41 for model 
3. This implies that model 3 should be a modest im- 
provement on models 1 and 2. This is bourne out by the 
statistics in table[8]for d — 1, d — 2 and d^, but for dmax 
model 3 performs the worst and is incorrect by some 
way. With r, model 3 is again the best and relatively 
close to the correct value. Regarding the clustering co- 
efficient 7, model 3 is best but all models are quite far 



away from the correct value. As predicted, model 1 and 
model 2 are hard to distinguish using these statistics. 
Overall, model 3 was best or close to best in almost all 
statistics measured, as the cq value predicts. All mod- 
els would be expected to be a good improvement on the 
random model and this is shown in all statistics except 
d = \ which random gets nearly exactly. 

4.1.3 Statistics on the RouteViews AS evolution model 



Model 


d=l 


d=2 


d2 


f^max 


r 


7 


Real 


0.203 


0.363 


2,110 


3,294 


-0.186 


0.00887 


Rand 


0.093 


0.118 


630 


2,289 


-0.0710 


0.00266 


1 


0.342 


0.185 


2,130 


4,172 


-0.154 


0.00631 


2 


0.350 


0.187 


2,520 


4,637 


-0.165 


0.00590 


3 


0.118 


0.358 


2,610 


4,844 


-0.163 


0.00443 



Table 9: Summary statistics RouteViews AS 
network. 



From table [H it would be expected that model 1 
(co = 4.81) would be the same as or very slightly worse 
than model 2 (cg = 4.83) and model 3 (cq = 5.00) would 
be slightly better than either. It is worth noticing that 
the ratio of these figures is small and the expected im- 
provement from 1 to 3 is slight. This hierarchy is bourne 
out by the statistics for nodes of degree 1 and degree 2 
with model 3 being considerably better in both cases. 
For d and dmax, however, the expectation is reversed 
and for these statistics, model 1 is better. Models 2 
and 3 are close to each other and the correct value for 
r but for 7 model 1 is better than either. In the end it 
is hard to say from these statistics which model is the 
best. The high values of cq do unambiguously claim 
that all models are superior to random by some way 
and this is certainly the case. The random model is the 
worst model for all statistics except for (i,„ax- 

4.1.4 Statistics on the gallery evolution model 



Model 


d=l 


d=2 


d2 


C^max 


r 


7 


Real 


0.0132 


0.473 


26.3 


214 


0.144 


0.0829 


Rand 


0.217 


0.117 


210 


30 


0.283 


0.000809 


1 


0.447 


0.235 


369 


1,442 


-0.065 


0.00689 


2 


0.279 


0.205 


38.0 


277 


0.160 


0.00992 


3 


0.0924 


0.453 


51.1 


354 


0.0708 


0.00537 



Table 10: Summary statistics gallery user brows- 
ing network. 



The gallery likelihood tabled shows that for model 1, 
Co = 0.815 (worse than random), for model 2 cq = 1.02 
and for model 3 cq = 2.43. This means that, in the 
statistics in table [101 model 3 should outperform model 
2, which itself should outperform model 1. This ex- 
pectation is largely bourne out by the degree statistics, 
with model 1 very inaccurate for all statistics based on 
node degree. However, in this case, it is hard to see 
the very clear distinction between model 2 and model 3 
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which is predicted by the cq values. Model 3 is certainly 
better at predicting the number of nodes of degree one 
and two and does quite well with cP and dmax- How- 
ever, it remains hard to claim that model 3 represents 
the clear improvement in model accuracy that the cq 
statistic would lead us to expect. 

We have a case where model 1 is expected to be worse 
than random and model 2 not much better. This cer- 
tainly seems to match the statistics provided. The rela- 
tively poor performance of model 3 remains an anomaly 
of this data set. 

4.1.5 Statistics on the Flickr evolution model 



Model 


d=l 


d=2 


d? 




r 


7 


Real 


0.639 


0.157 


7,500 


11,053 


-0.288 


0.00196 


Rand 


0.245 


0.179 


32.4 


35 


0.341 


0.000758 


1 


0.560 


0.172 


694 


1,704 


-0.119 


0.0216 


2 


0.572 


0.168 


1,290 


3,587 


-0.154 


0.0107 



Table 11: Summary statistics Flickr spider net- 
work. 



As detailed in section [531 only two models were tried 
for the Flickr data set. Table [H] gives extremely high 
Co values for both models, with model 2 being slightly 
better than model 1. This is definitely reflected in the 
statistics in table [TT] with model 2 being closer to the 
real data all statistics. The models are quite close on 
many statistics, but fail to predict the extremely highly 
connected node with degree 11,053. This may be an 
artifact of the browsing pattern of the spider, which 
may also be reflected in the <P value being incorrect. 
The high co values indicate that both models should 
be considerably better than the random model, which 
is certainly true - the random model is extremely bad. 
These results point towards considerable structure to 
the network network evolution which the random model 
fails to capture. 

4.2 Discussion on topology generation 

None of the models tested here were perfect at re- 
producing the selected statistics of their respective net- 
work data sets. In the majority of cases, the best fitting 
models reproduced the degree distribution related met- 
rics measured here, but finding the maximum degree 
was often difhcult. However, one thing the modelling 
in this section certainly shows is the difficulty of distin- 
guishing between models by considering a large number 
of, often correlated, statistics. 

Obviously the models tested here could be improved. 
However, the network statistics measured did rank the 
networks in the same order as the statistics from sec- 
tion [3] (the exception being model 3 in the gallery data 
which, while arguably the best model, was not better 
by the expected degree). This is an important confir- 
mation of the usefulness of the likelihood statistic cq 



in assessing the fit of network evolution models. The 
gallery data definitely proved an exception to expecta- 
tions, and this is perhaps due to the transient nature of 
this network as discussed in section [3^ 

A general conclusion of this section on the models 
themselves was that (apart from the gallery data), as 
might be expected, the PFP based models outperformed 
the degree based model, and the "tweaked" models from 
the fitting process in section 12.21 (where better models 
were found) did better still in most cases. 

For a given network and a given model the value of 
Co did seem (with one exception) to be an accurate pre- 
dictor of how well a model would replicate the statistics 
of a target network. However, it is hard to see a con- 
nection between the magnitude of cq between networks 
and the success in prediction. For example, the predic- 
tions on the arXiv network seemed very good for model 
3 for most statistics despite the model only having a cq 
value of 1.21. 

In general, though, relative likelihood statistics for 
the different models was reflected in the performance 
at reproducing representative network statistics. Those 
models with higher likelihoods (lower deviances) better 
reproduced the statistics of the target network. This 
confirms the usefulness of the framework for automatic 
model selection. 

5. CONCLUSIONS 

In this paper we present FETA, the Framework for 
Evolving Topology Analysis. The most important con- 
tribution of FETA is a statistically rigorous and un- 
ambiguous likelihood estimate for a model of network 
evolution that is quick to compute and does not require 
the generation of test networks for its operation. The 
method requires a target network for which the order 
in which links are added is known (at least approxi- 
mately) for a given period of time. Given this data, 
a model 9 which purports to explain the evolution can 
be compared either with a second model 9' or with the 
null (random) model 9q as an explanatory model for 
the link and node arrivals observed in the target net- 
work. The likelihood statistics can be efhciently calcu- 
lated and could be used, for example, as a fitness func- 
tion for a genetic algorithm or in state space exploration 
for parametrised models. 

A second contribution is a fitting procedure which 
allows the weightings of linear combinations of mod- 
els to be tuned automatically to fit the target network. 
This is an exploratory tool and, in addition to providing 
the weightings which best combine the models chosen, 
can guide the user as to which other model components 
might appropriate for the target network. 

Five different networks were tested, and several mod- 
els were produced for each. Artificial networks were 
grown for each model, and for each one of these a set 
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of summary statistics were compared against measures 
taken from the real target network. Models with better 
likelihood estimators were found to have better agree- 
ment with the statistics of the target network. This 
confirms that greater accuracy in terms of the likeli- 
hood estimator corresponds with a closer match to the 
final target network generated. 

A great deal of potential future work arises from this 
paper. The outer model (the part of the model which 
selects whether to add a node or an internal edge) was 
not investigated in any depth. It would be useful to 
consider the validation and tuning of more sophisticated 
outer models, and which also allowed node and edge 
deletion. 

A model form which has more promise than the linear 
combination of model components proposed in section 
12. 21 would be one with multiplicatively combined model 
components (that is a model of the form 9 — 9^^ 6 2^ • • • ). 
Logistic regression would seem a promising framework 
for this, but nontrivial problems exist with normalisa- 
tion. The evaluation framework from section [TT] would . 
however, work unchanged with this type of model. 

In short, the FETA framework is promising for de- 
velopment in many ways. The evaluation framework 
fits a broad class of models of network evolution and 
could be a very useful tool for researchers wishing to 
test hypotheses. The tools and data used in this paper 
are freely available for download and researchers are en- 
couraged to try thcnU. 
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